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Summary. Estimating a covariance matrix efficiently and discovering its structure are im- 
portant statistical problems with applications in many fields. This article takes a Bayesian 
approach to estimate the covariance matrix of Gaussian data. We use ideas from Gaussian 
graphical models and model selection to construct a prior for the covariance matrix that is a 
mixture over all decomposable graphs, where a graph means the configuration of nonzero off- 
diagonal elements in the inverse of the covariance matrix. Our prior for the covariance matrix 
is such that the probability of each graph size is specified by the user and graphs of equal 
size are assigned equal probability. Most previous approaches assume that all graphs are 
equally probable. We give empirical results that show the prior that assigns equal probability 
over graph sizes outperforms the prior that assigns equal probability over all graphs, both in 
identifying the correct decomposable graph and in more efficiently estimating the covariance 
matrix. The advantage is greatest when the number of observations is small relative to the di- 
mension of the covariance matrix. Our method requires the number of decomposable graphs 
for each graph size. We show how to estimate these numbers using simulation and that the 
simulation results agree with analytic results when such results are known. We also show how 
to estimate the posterior distribution of the covariance matrix using Markov chain Monte Carlo 
with the elements of the covariance matrix integrated out and give empirical results that show 
the sampler is much more efficient than current methods. The article also shows empirically 
that there is minimal change in statistical efficiency in using the mixture over decomposable 
graphs prior for estimating a general covariance compared to the Bayesian estimator by Wong 
et al. (2003), even when the graph of the covariance matrix is nondecomposable. However, 
our approach has some important computational advantages over that of Wong et al. (2003). 
Finally, we note that both the prior and the simulation method to evaluate the prior apply gen- 
erally to any decomposable graphical model. 
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1. Introduction 



Estimating a covariance matrix efficiently is an important statistical problem with many 
applications, such as multivariate regression, cluster analysis, factor analysis, and discrimi- 
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nant analysis; see, for example, Mardia et al. (1979). Such applications are used in the fields 
of Business, Engineering, and the physical and social sciences. It is also of considerable in- 
terest to understand the graphical structure of the covariance matrix because it is directly 
interpretable in terms of the partial correlations of the underlying multivariate distribution. 
By the graph of the covariance matrix we mean the pattern of nonzero off diagonal elements 
in the inverse of the covariance matrix, also called the concentration matrix (see Lauritzen 
1996, Chapter 5). Estimating a covariance matrix efficiently and understanding its graphi- 
cal structure are difficult estimation problems because the number of unknown parameters 
in the covariance matrix increases quadratically with dimension and by the requirement 
that the estimate of the covariance matrix is positive definite. 

There is a large literature of methods that use shrinkage or Bayesian models to improve on 
the maximum likelihood estimator of the covariance matrix. See, for example, Dempster 
(1969), Dempster (1972), Efron and Morris (1976), Yang and Berger (1994), Chiu et al. 
(1996), Giudici and Green (1999), Barnard et al. (2000), Wong et al. (2003) and Liechty 
et al. (2004). The simulation studies in Yang and Berger (1994) and Wong et al. (2003) 
show that considerable gains in efficiency are possible. 

Dempster (1972) advocates a covariance selection approach to estimate a covariance matrix 
more efficiently, by which he means setting to zero some of the off-diagonal elements of 
the concentration matrix. His idea is that a more parsimonious model will give greater 
efficiency. However, the selection of which elements to set to zero is difficult even for 
moderate dimensions because a p x p concentration matrix has p(p — l)/2 distinct off- 
diagonal entries and there are 2 p ( p ~ 1 ^ 2 possible graphs associated with it. Drton and 
Perlman (2004) give a model selection approach based on simultaneous confidence intervals 
to determine which partial correlations are zero. The simultaneous confidence intervals are 
based on large sample theory and become large when p is moderate to large. Drton and 
Perlman (2004) do not attempt to estimate the covariance matrix based on their selected 
graph. 

A number of articles take a Bayesian approach to covariance selection. For the case of 
decomposable graphs, Dawid and Lauritzen (1993) introduces a conjugate prior for the 
covariance matrix called the hyper inverse Wishart distribution. Giudici (1996) uses a prior 
for the covariance matrix that is a mixture of fixed parameter hyper inverse Wishart priors 
over decomposable graphs and calculates the marginal likelihood for each decomposable 
graph. The marginal likelihood is used to calculate the posterior probability of each graph. 
This gives an exact solution for small examples, but for p greater than approximately 8 the 
number of graphs is prohibitively large. 

Roverato (2000) shows that the hyper inverse Wishart prior for the covariance matrix is 
equivalent to a constrained Wishart prior for the concentration matrix. Although it is 
straightforward to define a constrained Wishart prior for general graphs, such distribu- 
tions have normalizing constants that are not available analytically unless the graph is 
decomposable. Roverato (2002), Dcllaportas et al. (2004) and Atay-Kayis and Massam 
(2005) propose efficient simulation and importance sampling methods for estimating the 
normalizing constants for the nondecomposable graphs. The normalizing constants are 
used to examine a small number of graphs and select those that that have the highest 
marginal likelihood or posterior probability, rather than to estimate the covariance matrix 
by averaging over graphs. However, such an approach seems unsuitable as the basis of a 
Markov chain Monte Carlo sampling scheme when p is moderate to large because there are 
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2p(p-!)/ 2 possible graphs with only a small fraction of them being decomposable. 

Giudici and Green (1999) give a MCMC approach that can deal with large values of p. Their 
method applies to a hierarchical model with a hyper inverse Wishart prior for the covariance 
matrix conditional on a decomposable graph. They use reversible jump Metropolis-Hastings 
methods to generate the covariance matrix and other parameters. Their method has a local 
computation property that only requires Cholesky decompositions of the submatrix of the 
covariance matix corresponding to a clique of the graph. Brooks et al. (2003) modify the 
reversible jump MCMC proposal of Giudici and Green (1999) and give empirical results to 
show this improves the convergence rate. 

Wong ct al. (2003) also use MCMC methods to select which off-diagonal element to set 
to zero. They use reversible jump Metropolis-Hastings methods to generate the inverse 
covariance matrix and other parameters. The main difference between Giudici and Green 
(1999) and Wong et al. (2003) is that Wong et al. (2003) do not constrain the possible graphs 
to be decomposable. Wong et al. (2003) use a prior with normalizing constants based on 
graph size to avoid having to calculate normalizing constants for each nondecomposablc 
graph. They also need to run a separate MCMC to estimate the normalizing constants for 
each graph size. 

For longitudinal data, Smith and Kohn (2002) factor the concentration matrix using a 
Cholesky decomposition and carry out variable selection on the strict lower triangle of the 
Cholesky to obtain parsimony. Their approach is attractive when there is some natural 
ordering of the observation vector, but there are two potential drawbacks to the Cholesky 
approach when such a natural ordering does not exist. First, different orderings of the vari- 
ables can yield different estimates of the covariance matrix. Second, under some orderings 
the Cholesky factor may be quite full even if the concentration matrix is sparse. 

In this paper we consider Bayesian estimation of decomposable covariance selection models, 
also known as decomposable graphical Gaussian models. Our article makes the following 
contributions. First, we propose a prior for the covariance matrix such that the probability 
of each graph size is specified by the user, whereas most previous approaches, e.g. Giudici 
and Green (1999), assume that all graphs are equally probable. We show by simulation that 
the prior that assigns equal probability over graph sizes outperforms the prior that assigns 
equal probability over all graphs, both in identifying the correct decomposable model and 
in estimating the covariance matrix more efficiently. This advantage is greatest when the 
number of observations is small relative to the dimension of the covariance matrix. We 
also show by simulation that there is minimal change in statistical efficiency in using our 
mixture prior compared to the estimator of Wong et al. (2003), even when the graph of the 
covariance matrix is nondecomposablc. 

Our prior requires knowing the number of decomposable graphs for each graph size. The 
second contribution of the article is to give a MCMC method for estimating these counts, 
and to show that the counts obtained by the simulation method agree with analytic results 
when such results are known. 

Our third contribution is to use the marginal likelihood results in Giudici (1996) to derive 
a reduced conditional MCMC sampler for decomposable graphical models, where the co- 
variance matrix is integrated out of all conditional distributions and is not generated in 
the MCMC. Our approach does not require reversible jump Metropolis-Hasting methods 
and has the local computation properties of the Giudici and Green (1999) approach, so the 
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computational complexity for one iteration of our approach is similar to that of Giudici and 
Green (1999). We give empirical results that show our sampler produces iterates that have 
much less autocorrelation compared to the methods in Brooks et al. (2003). We also show 
that our sampler has a faster convergence rate than the Wong et al. (2003) approach. Jones 
et al. (2005) uses a version of the marginal likelihood MCMC approach described in this 
paper that does not involve hyperparameters. This approach is used to find the graph with 
maximum posterior probability and the results are compared to stochastic search. 

The results in our article suggest that at present there is no 'best' method for estimating 
Gaussian covariance selection models. While the method of Wong et al. (2003) works in 
principle for all graphs, the convergence of their MCMC simulation can be slow if the 
true graph has full subgraphs of size 5 or larger because Wong et al. (2003) generate the 
elements of the concentration matrix one at a time. On the other hand the sampling 
scheme for decomposable graphs presented in our article is extremely efficient because the 
concentration matrix is integrated out and is an attractive alternative to the Wong et al. 
(2003) model for high dimensional graphs that are likely to have substantial full subgraphs. 
There are two other advantages of the decomposable prior considered in our article. The 
first is that there is a separate normalizing constant for each decomposable graph, whereas 
Wong et al. (2003) have a normalizing constant for each graph size. The second is that 
the Wong et al. (2003) model does not at present allow for hyperparameters in the prior. 
For example, using an equicorrelatcd prior as in Giudici and Green (1999) is not at present 
feasible with the approach of Wong et al. (2003) . 

The paper is organized as follows. Section 2 briefly introduces graphical Gaussian mod- 
els. Section 3 describes our Baysian covariance selection model and Section 4 describes our 
MCMC approach to estimating this model. Section 5 compares the prior that assigns equal 
probablity to each graph size to the prior that assigns equal probability to each decom- 
posable graph. Section 6 shows how to estimate the number of decomposable graphs for 
each size by simulation. Section 7 compares the efficiency of our sampler to the reversible 
jump approach in Brooks et al. (2003). Section 8 gives a Bayesian analysis of a multivariate 
dataset on physical measurements. Section 9 compares the prior that assigns equal prob- 
ability to each graph size to the prior of Wong et al. (2003). There are two appendices. 
The first gives the proofs of the results in the paper. The second gives a computationally 
efficient expression for evaluating the ratio of normalizing constants from Section 4.1. 



2. Background on Gaussian graphical models 

Before explaining our Bayesian covariance selection model we provide some background on 
Gaussian graphical models. Further details on such models are available in Dawid and 
Lauritzen (1993) and Chapters 2, 3 and 5 of Lauritzen (1996). 

Let g = (V,E) be an undirected graph with vertices V — {l,...,p} and set of edges 
E <ZV xV. For a square matrix A we write A > to denote that A is positive definite. Let 
M + (g) be the set of p x p matrices Q satisfying fi > and Qjj = for all pairs (i, j) £ E. 

For a given p x p covariance matrix S, we define the graph of S, g = g(T.) = (V,E), as 
follows. Let = Let V = {1, . . . ,p} and define E = i ^ j such that Ojj ^ 0}. 

Thus the graph g = g(E) gives the configuration of nonzero off-diagonal elements in Q. 
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We say that an m x m matrix A > has an inverse Wishart (IW) density with S > 
degrees of freedom and scale matrix denoted as A ~ IW(m, 5, <&), if the density of A is 



p(A|J,$) 



r m (|) 



(i+m + l) 



etr [ -^A' 1 



(1) 



where etr(A) = cxp(trace(A)) and for a > 



(m-l) 



r ro (a) = 7r ro ( ro - 1 )/ 4 nr(a-(- 



-)) 



is the multivariate gamma function (Muirhead, 1982, p. 113). 

Lauritzen (1996, Definition 2.3, p. 8) defines a decomposable graph and we refer to a covari- 
ance matrix S as decomposable if its graph g = <?(E) is decomposable. 

Suppose that g is a decomposable graph and let C\ , . . . , Ck be a perfect sequence of the 
cliques of g. Let Hj = C\ U . . . U Cj be the history of the sequence and let Sj = ifj-i n Cj 
be the separators for j = 2, . . . , k. For any matrix M and subset of vertices B, use Mbb to 
denote the symmetric submatrix of M which is formed by taking every corresponding entry 
Mij for which the vertices {V^, Vj} € B. Using the parameterization of Dawid (1981), we 
say E has a hyper inverse Wishart (HIW) distribution, with hyperparameters (5, <&) denoted 
by E - HIW(g, 6, $), if for E" 1 e M+(g) 



Y[p(2 CiCi \5, §c iCi ) 



P(S|J,$,5) 



ilp(Ss (S< |J,$s 4 s 4 ) 



(2) 



where (5 > 0, $ > 0, and the density is with respect Lebesgue measure on the elements of 
E corresponding to edges of g. 

In (2), the terms p{Y, Ci c i \5,^c i C i ) denote the IW densities E Cl c, ~ IW(\Ci\,5+ |C,| - l,$c s cj 
given by 



\Ci\ 



etr 



, (3) 



where |C,| denotes the cardinality of the clique Cj, and the terms p(Es,s 4 |5, $s 4 s ( ) are 
defined similarly. Note that the expression in (2) is invariant to the choice of perfect 
sequence. 

From (1) - (3), the normalizing constant for the HIW distribution is 

k 

^ 2 ) 



n 



h(g,6,$) 



2 



n 



|S 4 | 



(itl*l=l)- 



(4) 
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3. Bayesian Covariance Selection models 

3. 1. Likelihood and hierarchical structure 

Suppose we have independent observations 

yt~N(ii,E), t = l,-",n, (5) 

where y t is p x 1. Let y — (yi, • • • , y n ) be the data. We use a hierarchical prior for \i and 
E of the form 

p(fi, £, $, ,5, <?) = p( M |E, $, ,5, </)p(E|*, <5, «/)p($| J, <?)p(%)p(<?), 

where each of the terms on the right is discussed below. In our article we assume that 
p(fj,\T,,^,5,g) oc constant, as our focus is on priors for E. The prior for E depends on its 
graph g, the p x p matrix $ and the scalar 5, and is is discussed in Section 3.2. Section 2 
defines the graph of E as the configuration of nonzero off diagonal elements in E _1 . The 
prior for $ is discussed in Section 3.3 and the prior for the graph g is discussed in Section 3.4. 

In the article we restrict the graph of E to be decomposable, so that the prior for E is 
a mixture over all decomposable graphs. We explain in Section 2 that this is equivalent 
to the prior for Q = E _1 being a mixture over all Wishart distributions constrained to 
decomposable graphs. 

3.2. Prior for E 

We use the HIW prior (2) for E|$, 6, g, which allows E to be integrated out in the sampling 
scheme described in Section 4. Thus, our prior 

p(d£|*, J) = £>(d£|ff, 

9 

is a mixture of HIW distributions over all decomposable graphs g As discussed in the 
introduction, Roverato (2000) shows that the inverse of a HIW random matrix has a Wishart 
distribution, subject to the constraints imposed by the corresponding graph. Thus 

p(dS|*,J) = J2p(dX\g,$,6)p(g) 

9 

is a mixture of of constrained Wishart distributions over all decomposable graphs. 

In our article we set the degrees of freedom parameter 5 to 5 as such a value of 6 gives a 
suitably noninformative prior for E. 

3.3. Prior specification for $ and its parameters 

We consider the following three specifications for the hyperparameter and refer to them 
as the hyperprior forms of <j>: 



(a) $ = tI, t > where / is the p x p identity matrix. 
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(b) <j> = r(pj + (1 — p)I), t > where J is the p x p matrix of ones and p is a correlation 
coefficient that needs to be in the open interval (— 1), 1) for <f> to be positive 
definite. This specification is used by Giudici and Green (1999) and is called the 
equicorrelated version of $ because $>n — t and = rp for i ^ j. 

(c) $ = TS y /(n - 1), where t > 0, 

n 

Sy^ivt-vXvt-vY, (6) 
t=i 

and y is the mean of the yt- 

We motivate the choice of $ in two ways. First, by integrating p out of p(y\p, £), 
with constant, we obtain 

p^lSJalSI-^-^etr^-^E- 1 ). (7) 

Suppose g is a decomposable graph. If we take p(E\g) oc p(y\T,) 1 ^ n ~ 1 \ then from 
(7) and equation (3) of Giudici (1996), we can write p(E\g) in the form (3) with 
* = S v /(n-l). 

A second motivation for this choice of $ is to note that if S <~ HIW(p, 5, <£>), then 
S(Scc) = &cc/{5 — 2) for any clique C — d or separator C = S, in (2). Since 
{S y )cc /{n— 1) is an unbiased estimator of £cc, this suggests taking $ oc S y /(n — 1). 

We assume in all cases that r is uniform on the interval [0, T] where V is large, e.g. T = 10 10 , 
and in the equicorrelated case that p is uniform on the open interval (— — 1), 1). 



3.4. Prior for g 

We first define notation for the edge indicators of a graph g. Let 

l if(U)e£ (g) 

otherwise 

and let e_jj = {e^; : (fc, I) ^ (i,j)}- Note that any graph g — (V, E) can be unambiguously 
written as g = (ey, e~ij)- 

For a given graph g = g(T,), let the number of edges, or the size of g, be given by 

size(g) = ^ e ij ( 9 ) 

i<j 

i.e. size(g) is the number of nonzero elements in the strict upper triangle of O, and size(g) < 
r = p(p-l)/2. 

Because of the theoretical and practical difficulty in calculating for a given p the exact 
number of decomposable graphs, or the number of graphs of a given size, most of the 
literature for both decomposable and general models takes the prior for g as uniform over 
all the relevant graphs; see, for example, Giudici and Green (1999), Dcllaportas and Forster 
(1999), Geiger and Hcckcrman (2002), Giudici and Castelo (2003), Roverato (2002), and 
Atay-Kayis and Massam (2005). 
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Such a prior favours any class of graphs with many members over a class with few members, 
and favours middle sized graphs over both very large and very small sized graphs. 

Let A Pt k denote the number of graphs of size k. We specify the prior for a graph g hierar- 
chically as follows. 

p(g\size(g) = k) = - — , 

so that all graphs of a given size are equally likely. We now specify the prior for the size of 
a graph. One choice is 

p(size = k) oc A p ^, 

which means that 

P(g) =p{g,size(g)) = p(g\size(g))p(size(g)) oc constant, 
giving the uniform prior for g. A more flexible prior is of the form 



p{size = k\^)= Qv fe (l-VO r - fe , 



where we interpret tp as the probability that any two vertices have a common edge. We 
could then put a prior on ip. Suppose we take the prior for ip as a beta with parameters a 
and b, i.e. 

V> a_1 (i- V) 6 " 1 



p(ip) = 



p(size = k) 



B{a,b) 

B(a + k,r-k + b) 



kj B(a,b) 



P(g) = P(g\size(g))p(size(g)) (10) 
B(a + size(g),r- size(g) + b) 



Then, 



and 



size(g)J A P:Slze(g) B(a,b) 

where B(a, b) is the beta function. We could now also put a prior on a, b. In our article we 
take ip uniform so that a = b = 1 , which means that 

p( Si ze = k) = j-^- an d p(g) = - - . 

That is, the size of each graph has equal probability, and the probability of a graph of size 
k conditional on size = k is uniform. However our framework is more flexible than this. 

We call this the size based prior for g and compare results against those using a uniform 
prior. 

The size based prior makes it easier to discover sparse and full graphs when n/p is small. 
The counts A p ,k are not available in the literature. Section 6 gives results to calculate a 
subset of them analytically, and shows how to evaluate the rest by simulation. 
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4. Posterior inference and Markov chain Monte Carlo sampling 



We use Markov Chain Monte Carlo (MCMC) simulation to obtain all posterior distributions. 
The simulation involves the generation of the graphs g and the parameters in $ but not S 
and \x which are integrated out. Thus, our sampling scheme is said to generate from reduced 
conditionals and is therefore expected to be more efficient than the sampling schemes in 
Giudici and Green (1999) and Wong et al. (2003) that generate £ as part of their sampling 
scheme. 

We note that iterates of \i and S can also be generated in conjunction with the simulation, 
but such iterates of /x and S do not have any influence on the convergence properties or 
dependence structure of the reduced conditional simulation. 

The following theorems arc useful in evaluating the conditional distributions required in the 
simulations. The first theorem gives a conjugate prior property of the HIW distribution. 

Let S y be defined by (6) and define 

$* = $ + S y and 5* = S + n - 1 . (12) 

Theorem 1. (Dawid and Lauritzen, 1993) For the Bayesian model specified by (2) and (7) 

Z\y,5,$,g~HIW(g,6*,$*). 

Proof. See Dawid and Lauritzen (1993) or Appendix A. 

The next theorem gives an expression for the marginal likelihood. 

Theorem 2. (Giudici, 1996) For the Bayesian model specified by (2) and (7), 

P(y\^ 9 ) = (2n)-^-^^^ (13) 

PROOF. See Giudici (1996) or Appendix A . 
4. 1 . Sampling the graphs g 

We sample the graphs g by generating the edge indicators one at a time, conditional on 5, $ 
and e_y = {e«, (k, I) ^ k < 1} using the following MH sampling scheme. 

Using the notation of Section 3, let g c = (V, E c ) be the current graph of X, which is 
decomposable by construction with edge indicators {e c kl : 1 < k < I < p}. 

We choose a pair at random and suppose that g — (ey, ef.^-) is decomposable for 

both dj — and = 1. We use the legal edge addition and deletion characterizations of 
Giudici and Green (1999) and Frydenberg and Lauritzen (1989) respectively to ensure this. 
Otherwise we choose a new pair (i,j). 

Set the proposal graph as g p (conditional on g c ) as g — (e?-, el^ ) where e?. = 1 — e\y This 
means that the proposal density for is q g (a\b, e'Lij) where a and b are each either or 1, 
and q g (a = 1 — b\b, e^-) = 1. 
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The MH acceptance probabilty for the proposal is 



f p(y\gP^,S)p(gP) \ 
\ ' p(y\g c ,<P,5)p(g c )j 



(14) 



because q g {e c i j\e p i -,e c _ i j)/q g {e p i -\e'lpe c _ ij ) = 1. The ratio p(g p )/p(g c ) is known and the ratio 
of marginal likelihoods 

p(y\gP,*,6) = %M,$)% C , £*,$*) 
p(y\g c ,^,5) h(g c ,5,$) h{gP,6*,$*)' 

A simple expression for (15) is derived in Appendix B. 



4.2. Generating the parameters in $ 

In all cases of Section 3.3 we generate r using a random walk MH method 

log(rP) = log(r c ) + £ r , ~ JV(0,<7?), 
which has acceptance probability 

p(y\g,r p ,p) p(t p ) \ 
' p(y\g,r c ,p) p(t c ) J 

as the proposal densities cancel out. In the equicorrelated case, the parameter p is generated 
similarly to r by a random walk MH method 

P P = P C + £, P , Z p ~N(0,a 2 p ). 

The choice of the variances a 2 , a 2 is sensitive to p, and was fine tuned to attain acceptance 
probabilities of around 25% according to the acceptance rate of the proposals. For the case 
p = 17 reported in this paper, such an acceptance probability resulted from using a 2 = 1/10 
and a 2 = 1/20. 



4.3. Generating E, il and p 

Although p, S and fl are not generated in the MCMC simulation, it is often necessary to 
estimate functionals of p, £ and O. Such functionals can be estimated by sampling from 
the posterior distribution of E, il and p. Conditional on (g, 5, $) it follows from Theorem 1 
that p(T,\y,g,5, <&) is HIW (6 + n — 1,$ + S^) so that E and ft can be generated using 
Theorems 3 and 4 of Roverato (2000). It is straightforward to show that p(p\y, E, <7, 5, $) is 
Af(y, E/n), and hence to generate p, giving iterates {p^\ fi^'l , j > 1} from the posterior 
distribution. 



4.4. Efficient estimation ofE(Q\y) 

The posterior mean of f2 is not only used as an estimator of fi, but also of £ because 
i?(0|2/) _1 is the Bayes estimator of £ for the L\ loss function in Section 5. One method 
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of estimating E(Q\y) is to use the histogram estimator J -1 Y^j=i A statistically more 
efficient estimator is the mixture estimator J -1 X^f=i ^(^\Ui 9^ > <^> $^)- 
We now show how to efficiently compute E(£l\y, g, S, $) using the following notation from 
Lauritzen (1996). Supp 
p x p matrix defined by 



Lauritzen (1996). Suppose that A is a p x p matrix and S C V. Let B = [Ass^ be the 



A i:j tt{i,j}cS 



13 y otherwise 

Theorem 3. Suppose that Q\y <~ W(g,S*,$*), where g is decomposable. Then, using the 
notation of this and Section 3, 



£;(%,5,$, 5 ) = ^[(r + iai-i)($c i c i )" 1 ] -£[(<** + isi - 1) *] • 

i=l i=2 

(17) 

Proof. See Appendix A. 



5. Comparison of the size prior for a graph with the uniform prior 

This section compares the prior based on the graph size with the uniform prior that is used 
in most previous articles. Performance is in terms of a loss function and a simulation was 
carried out to numerically assess performance. We found that overall the size based prior 
for g outperformed the uniform prior. 

Our simulation considered the following five graph types for g. (a) Q — I, the identity 
matrix, representing the empty graph and a diagonal covariance matrix; (b) f2 tridiagonal, 
representing a sparse and decomposable graph (this is a chain graph with p — 1 edges); 
(c) £1 an 'extreme' full matrix (the correlation coefficients pij of f2 _1 satisfy \pij\ > .30), 
which is a complete graph; (d) O corresponding to a 4-cycle on p vertices representing a 
sparse but nondecomposable graph; and (e) ft corresponding to a p— cycle on p vertices, 
again representing a sparse but nondecomposable graph. We note that the nondecomposable 
graphs in (d) and (e) require the addition of extra edges when we estimate them by a mixture 
of decomposable graphs. Furthermore, (e) is an extreme case of non-decomposability, as it 
requires a minimum of p — 3 fill ins. Conversely, the unchorded 4-cycle on p nodes requires 
the fewest number of fill ins, so was chosen as an indicator of performance for the sparsest 
nondecomposable case. 

The simulation considered the three forms of <& described in Section 3.3 and two sample 
sizes n = 40 and n = 100. We report results for matrices of size p = 1 7, but similar results 
were obtained for matrices of other sizes. 

Let S T be the true value of £ and let E be an estimator of S T . We measure the performance 
of E using the Li loss function 

ii(E, E T ) = tracc(EE T 1 ) - logdct(EE T 1 ) - p. (18) 

This loss function is frequently used to compare estimates of the covariance matrix, e.g. Yang 
and Berger (1994). It is straightforward to show that L\ > for all E and Ey, and that it 
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is only equal to if E = St- It is also straightforward to show that for y ~ -/V(0, E), 

St) = - / p(y|E) to s(^jjf^) ^ ( 19 ) 

i.e. L\ is equivalent to a Kullback-Liebler distance between St) andp(y|E) with respect 
to the density p(y|E). The Bayes estimator for S for the L\ loss function is -^(fily)" 1 , which 
can be computed as in Section 4.4. 

We use boxplots to compare replication by replication the size based prior with the uniform 
prior in terms of the percentage increase in the loss function L\ resulting from using the 
uniform prior compared to the size-based prior; i.e. the boxplots are based on calculating 

10Q r L unif _ L sizey L size 

for each replication, where L\ mf and Lf ze are the values of Li(S, St) for the uniform and 
size based priors respectively. 

The boxplots are based on 20 replications with each replication consisting of 2,000 burnin 
iterations and 20,000 sampling iterations. We ran the sampler for the case p = 17 on n = 40 
and 100 observations from five simulated data sets corresponding to the five models (a)-(e) 
for n. 

Figure 1 presents the results for p = 17. The plots show that for <& = tI and <& equicorre- 
lated, the size prior is at least as good, and often much better than, the uniform prior. For 
$ = TS y /(n — 1), the comparison between the size prior and the uniform prior is inconclu- 
sive for n = 40, but for n = 100 the size prior is at least as good as, and often better than 
the uniform prior. We conclude that the size based prior outperforms the uniform prior. 

We also compared the performance of the three forms of <& for the uniform and size priors 
and found that overall the equicorrclated form of $ using the size based prior for the graph 
performed best, and it is this combination that we use for the rest of the paper. 



6. Evaluating the size based prior 

To use the size based prior for graphs on p vertices, we need the set of numbers {A Py k '■ 
k = 0, . . . , r} where A p _ k is the number of decomposable graphs of size k on p vertices, and 
r = (j) is the maximum graph size. These numbers are not in the literature, nor is there a 
general method available for computing them. In this section we present some exact values 
of A Py k as well as a simulation method that can estimate the A p ^ as precisely as necessary. 

Let _B Pi fc be the number of connected decomposable graphs of size k on p vertices. Equa- 
tions (3) and (4) of Castelo and Wormald (2001) give recurrences to calculate A p ^ from 
the B Py k analytically, and the information to calculate all B Pt k analytically is implicit in 
Wormald (1985). For p < 8, Wormald (1985) gives the B p ^ from which we computed the 
A Pt k and these are reported in Table 1. 

However, Wormald's (1985) analytic approach for obtaining the B p ^ is likely to be com- 
putationally intractable for p > 25 (private correspondence with Wormald) and even for 
8 < p < 25 obtaining the B Py k would take weeks on realistically sized computers. Further- 
more, analytically deriving the A Py k from the B Py k is computationally feasible only for small 
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Table 1. For each p, 2 < p < 8 the table gives each A p>k , < k < r and 
A p = J2l=o A p> k - Tne taDle also 9 ives for eacn P tne percentage of graphs that are 
decomposable. 
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Fig. 1 . Percentage increase in loss of uniform prior relative to the size prior measured under Li loss. 
The left panels correspond to n — 40 and the right panels to n = 100. taul, equi and tauS correspond 
to $ = tI, $ equicorrelated and $ = TS y /(n - 1). 



p. Because of these difficulties we propose a simulation methodology to estimate the A p ,k 
for all p. 



6. 1. Methodology 

We begin with some exact results which can be used to calculate {A p j~ ■ k < 5 and r — 2 < 
k < r} analytically for any p. Let F p ^ denote the number of nondecomposable graphs 
having p vertices and k edges. 

Lemma 4. (a) A p . k = Q - F pM . 

(b) F P . Q = F Ptl = F Ptr = 0, p > 0. 

(c) F p . 2 = F p . r _! = 0, p > 2. 

(d) F p , 3 = 0,p>3. 



Proof. The proof is obvious. 
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Lemma 5. (a) For p > 4, F pA = (£) x 3. 

(b) Forp>4, F p , r _ 2 = F Pt J 

(c) Forp> 5, F p<5 = (I) x 12 + (?) x 3 x (r - 6). 



PROOF. See Appendix A. 



We now show how to estimate the {A p ^ ■ 6 < k < r — 3} for all p. Our approach is to run 
a separate simulation to estimate each A Pt k for 6 < k < r — 3. The simulations are done in 
ascending order of fc, i.e. k = 6, . . . , r — 3, and the simulation to estimate a particular A p ,k 
is restricted to graphs of size < k and uses the estimates A p j of A p j for j = 6, • • • , k — 1 
that have been calculated in previous simulations. 

We now describe the details of the simulation to estimate a particular A p .k- Let <f) p .k be the 
initial estimate of A p .k given by 

A 2 

4> P ,k = 5p, fc J' -1 (20) 

^4p,fe-2 

with 5 Pi fc chosen in the range (0.5, 1). To justify this choice of <j) p ,k, we note that we have 
found empirically that \ogA P} k is approximately a negative quadratic (see figures 2 and 3) 
so that \ogA Pt k — 21ogA Pi fc_i + \ogA Pt k-2 < 0, and hence 

A p k/Ap,k-i 
a P,k = j I a ^ L 



We have also found empirically that a Pi k is likely to exceed 0.5. 
As 



A 2 

P,k-1 



A p .k-2 

the above discussion suggests the choice of 4> p ^ in (20). Further details on the choice of 
5 Pi fe can be obtained from the authors. 

We use Lemmas 4 and 5, the estimates A p j of A p j for j = 6, • • • , k — 1 that have been 
calculated in previous simulations, and the initial estimate 4> Pt k of A p ^ given above to define 
the following probability distribution p e (g) on the graphs g of size < k. To simplify the 
notation we omit subscripts for p and k in p e (g)- 



Pe(g) OC < 



which implies that 



<t>P,k 



if < size(g) < 5 
if 6 < sizc(g) < k — 1 
if size(g) = k 

A Py k/ 4>p,k 



(21) 



p e (size = k) 

P^ize < 5) _ E 5 1=0 A M /A P 



— ^A Pt k/ (j>p,k 
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and hence 

p e (size = k) 
Ap ^ 6 ^ k p e ( Sl ze<5Y 

By running the simulation described below based on p e (g) we can estimate the ratio 
p e (size — k)/p e (size < 5) by their relative frequencies and hence obtain an estimate of 

2 a Pejsize = k) 
p e \size < 5) 

where p e (size = k) and p e (size < 5) are the empirical relative frequencies. 

The simulation uses the following MCMC sampling scheme. As in Section 4.1, we generate 
the edge indicators one at a time conditional on the other edge indicators. Let g c — (V, E c ) 
be the current graph with edge indicators given by {e-ki ■ (k,l) € E c }. We select an edge 
at random. If g = (e^, el fj -) corresponds to a decomposable graph of size < k for 
both eij = and = 1 then we proceed, where we again use the legal edge addition 
and deletion characterizations of Giudici and Green (1999) and Frydenberg and Lauritzen 
(1989) respectively to test this. Otherwise we select a new edge. If we proceed, then we 
propose a new graph g p = (1 — e?., eij ■) and accept this graph with probability 

mm{l,p e (gP)/p e (g c )} 

which is evaluated using (21). 

We note that at each stage we can also re-estimate A p j, j = 6, • • • , k — 1. 



6.2. Results 

This section presents the estimates A p ^ for k = • • • r and p — 8 and 34, and provides a 
general method to check on the quality of these estimates. Define the prior p e (g) on the 
decomposable graphs g as 

{■j — if < size(g) < 5 or r — 2 < size(g) < r 

P, "" ( " if 6 < size( ff ) < r - 3. 

The prior p e in this section is different to p e in Section 6.1. If the estimates A p ,k, 6 < k < 
r — 3 are precise, then p e (size = k) should be close to uniform and hence close to the target 
value l/(r + 1). An approximate lower bound for the standard error of the estimates of 
p e (size — k) is \J-n{l — ir)/J, where n = l/(r + 1) and J is the number of iterates used 
to compute p e (size — k). Our simulations use a burnin period of 2,000 iterations and a 
sampling period of N = 10,000 iterations. Figure 2 plots the estimates A p ^ for p = 8 
and the true values A^_k, k — 0- • -r on both an absolute and logarithmic scale. Figure 2 
also plots the estimates of p e (size — k) together with the target value l/(r + 1) and lower 
bounds for the ±3 standard error lines. 

Figure 3 has the same interpretation as Figure 2 but is for p = 34. The true values of A^^ 
are not plotted as they are mostly unknown. 

For p = 9, • • • , 12 the totals A p — A p j are known, but not the A p j. As a further 

check on results we compared our estimated values of A. p to A p and found that we were 
consistently within 1% of the truth. 
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(a) (b) (c) 



Fig. 2. Panel (a): Plot of true A 8 , k (-) and estimates A 8 , k (open circles), k = 0, • • • r. Panel (b): Log 
scale of plot (a). Panel (c): Plot of p e (size = k) together with their target value of l/(r + 1) (middle 
horizontal line) and ±3 approximate standard errors (outer horizontal lines). 




200 400 200 400 ' 200 400 

(a) (b) (c) 

Fig. 3. Panel (a): Plot of estimates Az iik k = 0, • • • r. Panel (b): Log scale of plot (a). Panel (c): 
Plot of p e (size = k) together with their target value of l/(r + 1) (middle horizontal line) and ±3 
approximate standard errors (outer horizontal lines). 



7. Comparsion of sampler efficiency to reversible jump approaches 

This section compares the efficiency of our sampler to the reversible jump approaches de- 
scribed in Brooks et al. (2003) using the six dimensional fowl bones dataset (Whittaker, 
f990). To conform with the results given in Brooks ct al. (2003) we use the equicorrelated 
form of <E> and the uniform prior with a simulation run length of 1 million thinned to every 
10th to give 100000 generated graphs. 

The plot of the number of edges in the generated graphs given in Panel (a) of Figure 4 can 
be compared to Figure 2 of Brooks et al. (2003). This plot shows that our sampler has 
much less dependence than the best performing approach in Brooks et al. (2003) which is 
the correlated AV method. The plot of the cumulative number of graphs visited given in 
Panel (b) of Figure 4 can be compared to Figure 3 of Brooks et al. (2003). This plot shows 
that we visit 315 different graphs after 100 000 generated graphs in 1 million iterates. This 
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compares with 245 different graphs visited for the best performing method in Brooks et al. 
(2003) which is the correlated AV method. Note that our sampler reaches the cumulative 
total of approximately 250 graphs in the first 20 000 generated graphs which corresponds 
to the first 200 000 iterates. 




U I I I I 

200,000 400,000 600,000 800,000 1,000,000 
Number of iterations 
(b) 



Fig. 4. Plots for the fowl bones dataset using the equicorrelated form of $ and the uniform prior. 
Note that there are 100 000 generated graphs, however, the horizontal axes run from 1 to 1 million 
iterates to reflect the thinning process as described in the main text. Panel (a) is the number of edges 
in each generated graph. Panel (b) is the cumulative number of graphs visited during the simulation. 
The total number visited is 31 5. 

A numerical comparison between the methods is given by the effective sample size (ESS) 
(Kass et al., 1998) for the thinned sample of 100 000 iterates of the number of edges plotted 
in Panel (a) of Figure 4. The ESS for our method is 46 891, which is over 30 times larger 
than the best performing method in Brooks et al. (2003) (the correlated AV method) which 
has an ESS value of 1403. Note that our ESS value is approximately 50% of the maximum 
value of 100 000 for an independent sample. 



8. Physical measurements data 

In this section we illustrate our methods on a dataset consisting of the weight and various 
physical measurements described in Larner (1996) on 22 male subjects aged 16 to 30. The 
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subjects were randomly chosen volunteers and were all in reasonably good health. They were 
requested to slightly tense each muscle being measured to ensure measurement consistency. 
Apart from Moss, all measurements are in cm. 

The p = 11 variables are indexed in the following order: 

(1) Mass: weight in kg, 

(2) Fore: maximum circumference of forearm, 

(3) Bicep: maximum circumference of bicep, 

(4) Chest: distance around chest directly under the armpits, 

(5) Neck: distance around neck, approximately halfway up, 

(6) Shoulders: distance around shoulders, measured around the peak of the shoulder blades 

(7) Waist: distance around waist, approximately trouser line, 

(8) Height: from top of head to toe, 

(9) Calf: maximum circumference of calf, 

(10) Thigh: circumference of thigh, measured halfway between the knee and the top of the leg, 

(11) Head: maximum circumference of head. 

Figure 5 summarises the output using the equicorrelated form of <& and the size based prior 
for the graph. Panel (a) gives the estimate of the partial correlation matrix which equivalent 
to with the diagonal entries normalised to one. Panel (b) gives the posterior probabilities 
of each edge being present. Panel (c) gives the graph that results from applying a 70% 
threshold to the values in Panel (b). Note that the procedure to obtain the graph in Panel 
(c) does not guarantee decomposability. 




(a) (b) (c) 



Fig. 5. Plots for the physical measurements dataset using the equicorrelated form of $ and the size 
based prior for the graph. Panel (a) is the image plot of the estimate of the partial correlation matrix. 
Panel (b) is the image plot of J. Panel (c) is the 70% graph. 



9. Comparsion to the Wong et al. (2003) covariance selection prior 

This section compares the performance of the prior in our article to the covariance selection 
prior of Wong et al. (2003), which does not assume that the graph of the covariance matrix 
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is decomposable. Based on the results in Section 5, we use the equicorrclatcd form of $ 
and the size based prior for the decomposable graphs. 

The design of the simulation study is similar to that in Section 5. We use L\ as the loss 
function, p = 17, two sample sizes n = 40 and n — 100, and four graphs for Q: identity, 
tridiagonal, 4-cycle and 17-cycle. 

We refer to the decomposable prior as DCP and the nondccomposable prior of Wong et al. 
(2003) as NDP. Figure 6 reports boxplots of the percentage increase in L\ of DCP over 
NDP for each iterate, i.e. 



100(£[ 



DCP 



L 



NDP 



NDP 



Figure 6 shows that both priors perform similarly for decomposable graphs and nondecom- 
posable graphs, for both n = 40 and n = 100. These results and others suggest that the 
prior based on decomposable graphs performs similarly to that of Wong et al. (2003) when 
the graphs are relatively sparse. 



40 
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-20- b= i- 



-40 



ident tridi 4-cyc 1 7cyc 




4-cyc 17cyc 



Fig. 6. Percentage increase in L x for DCP over NDP. The left panel is for n = 40 and the right is 
for n = 100. 



Next we report autocorrelation plots for the iterates of the elements of f2, when p = 5 and 
the graph is full for both DCP and NDP when n — 40. The simulation for DCP uses a 
burnin of 50,000 iterations and a sampling of 50,000 iterations, and 500,000 burnin and 1 
million sampling iterations for NDP. 

Figures 7 and Figure 8 are the autocorrelation plots for the DCP and NDP models for a 
representative selection of Qij . The figures show that the autocorrelations of the iterates of 
the flij decay rapidly to zero for the DCP model, but are far more dependent in the NDP 
model. This difference in dependence is due to the greater efficiency of the sampling scheme 
in the decomposable case. Grey scale plots of the true inverse covariance fi and posterior 
mean estimates of 51 for the NDP estimator and the DCP estimator for the 17-cycle case 
indicated that NDP and DCP performed similarly in the simulations. For brevity only the 
nondecomposable 17-cycle is presented as it represents a case of high non-decomposability. 
Figure 9 shows that even in this case, the grey scales are very similar. 
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Fig. 7. Autocorrelations of the iterates of the f2 i3 in the DCP case for a representative selection of 
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Fig. 9. True inverse covariance Q and posterior mean estimates of fi for the NDP estimator and the 
DCP estimator for the 1 7-cycle case. 
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A. Proofs of results 



Proof of Theorem 1 

Roverato (2000) shows that if £ ~ HIW(g, 6, $) and Q = E" 1 then 



p(%,<y,$)a|fi 



15 ' , 



(22) 



The result then follows from (7) since 

p{n\y,g,6,*)<xp{y\Sl)p{n\g,6,$) 

oc |r»| ( "- 1)/2 ctr |fi| ( ^ 2)/2 etr 

= |r!|("+ 5 - 3 )/ 2 etr (~n {s y + $; 



Note that the conjugate prior result for does not require the graph g to be decomposable. 

Proof of Theorem 2 
First 

The result then follows from (2), (3), (7) and Theorem 1. 
Proof of Theorem 3 

From Equation (5.23), Lemma 5.5 of Lauritzen (1996) 

n = ^[(s CiCi )- 1 ] y -^[(s SiSi )- 1 

»=i 

and hence 

k 

E (Q\Y, S, $, g ) =J2[ E ((^c.c.r 1 \Y, S, $ 



i=1 



K 



i=2 



Now E|y, 5, $, g ~ HIW (S, <&*,#*), so from Dawid and Lauritzen (1993), if A is a complete 
set in .g then (S^a) 1 |y,£, $,5 ~ Wishart (5* + \ A\ - 1,$\ A ). The result then follows 
from the properties of the Wishart distribution. 

Proof of Lemma 5 



(a) For a nondecomposable graph to have 4 edges it must contain exactly one chordlcss 
4-cycle and no other edges. There are (^) possible choices for the 4 vertices, and for 
each choice of 4 vertices there are 3 different chordless 4-cycles. 

(b) For a graph to be nondecomposable with — 2 edges it must contain exactly one 4 
cycle and all other edges must be present. Then apply the proof of the above. 

(c) We can partition the nondecomposable graphs with 5 edges into 2 sets: (a) those 
with a chordless 5-cycle and no other edges, and (b) those with a chordless 4-cycle 
and an extra edge. For case (a) there are (|) choices for the 5 vertices and for each 
choice there are (5 — l)!/2 = 12 different chordless 5-cycles. For case (b) there are 
(4) x 3 choices for the chordless 4-cycle, and for each choice of chordless 4-cycle there 
are ((?J) — 6) choices for the extra vertex pair constituting the edge. 
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B. HIW results for Bayesian analysis using MCMC 



The following results derive an expression for (15) that can be evaluated efficiently. The 
first theorem gives some necessary graph theory. 

Let g = (V, E) be a decomposable graph with edge indicators {e^ , i < j < p}. Assume the 
edge indicator = 1 for g, and that the graph g' — (V,E') is decomposable and has edge 
set E' as defined by indicators {e- = 0, e_y}. 



Theorem 6. Suppose that g and g' are the decomposable graphs defined above. Suppose 
that Ci, . . . , C'k are the cliques of g ordered to form a perfect sequence and S2, ■ ■ ■ , Sk are 
the corresponding separators. Then 

(a) The edge (i,j) is contained in a single clique of g. 

(b) If G C q then either i £ S q or j ^ S q . 

(c) If j £ S q and C qi = C q \{j} and C q2 = C q \{i} then Cy, . . ., C,_i, C qi , C q2 , C q+1 , . . ., 
Ck is a perfect sequence of complete sets in g' and has separators S2, ■ ■ ■ , S q -i, S qi = S q , 
Sqi = C q \{i,j}, S q+ i, . . ., Sk- 

(d) The sequence C\, . . ., C q -\, C qi , C q2 , C q+ i, . . ., Ck contains all the cliques of g' . 



Proof. Part (a) is Theorem 1 of Frydenberg and Lauritzen (1989) . 

Parts (b) and (c) follow from part (a) and Lemma 2.20 of Lauritzen (1996). 

To show part (d), suppose that C* is a clique of g' . Then C* is complete in g, so C* C Ci 

for some I & {1, ... ,k}. If C* C C q then part (b) implies that either i ^ S q or j £ S q . So 

either C* C C qi or C* C C q . 2 . Hence C* is contained in at least one of C\, . . ., C ? _i, C qi , 

C q21 C q+ i, . . ., Cfe. Part (c) shows that C\, . . ., C q -\, C qi , C q21 C q+ i, . . ., Ck are complete 

sets in g' and the result follows. 



The next lemma uses (4) and Theorem 6 to simplify (15). 



Lemma 7. Suppose that g and g 1 are the decomposable graphs defined above. Then, using 
the notation of (12), and Theorem 6 



%,<?,$) hjg',8*,^ 
h(g',5,$) h(g,6*,f>* 



D-D IS, 



>+|s, 2 | + i 



n\s q 



£*+l£g2l 



ii\S q . 
5+1 S,, 



4> 



30\S q 



DD\S q 



F 



5+ 5„, +1 



s*+\s q2 \ 



(23) 



whereD = {i,j}, $> DD \ Sq2 = $DD-$DS q2 ($S q2 sJ 1 $S q2 D, and<S> iilSq2 , $ jjlSq2 , ®* DD \s q2 
®ii\S q an d ®jj\S g are defined similarly. 
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Proof. To obtain an expression for h(g',S,$) we require the following technical lemma 
based on Lemma 2.13 of Lauritzen (1996). 

Lemma 8. Let C\, . . ., be a perfect sequence with separators S2, ■ ■ ., Assume that 
Ct C C p for some t 7^ p and that p is minimal with this property for fixed t. Then 

(a) If p < t then S t — C t and C\, C t ~\, C t+ \, is a perfect sequence with 
separators S2, ■ ■ ■, St-i, St+i, ■■■>£% 

(b) If p > t then S p = C t and C\, . . ., C t -\, C p , C t +i, ■ ■ ., C v -\, C p+1 , C~ is a perfect 
sequence with separators S2, ■ ■ ■, S t -i, S t , S t +i, ■ ■ ., S p -\, S p+ i, Sj. 



Proof of Lemma 8. See Lemma 2.13 of Lauritzen and its proof. 

From Lemma 8, a perfect sequence of complete sets C\, . . ., containing the cliques of g' 
can be thinned by removing complete sets that are not cliques and reordering the sequence. 
From Lemma 8, the right-hand side of (4) is invariant to this thinning process. Successive 
application of the thinning process gives a perfect sequence consisting of the cliques of g' . 

From (4), Theorem 6 and Lemma 8 
h(g',5,$) 



n 

i=l,...q—l,qi,q2,q+l,---,k 



S+\Cj\-l 



1 \d\ { — 2 — ) 



n 

i=2,...q—l,qi,q2,q+l,...,k 



2 



l \St\ { 2 ) 



(24) 



Now consider the ratio h(g,5,$)/h(g',6,$). Simplifying the expressions from (4) and (24) 
gives 



h(g>,5,$) 



S +\ S q 2 \ + 1 



Sq Sq 



«+\s q2 \ 



\^Cg 1 Cg 1 
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Substituting 
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l $ s, 2 
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into (25) gives 

h(g,6,*) 



DD\S a 



L±i£|2j+1 ^ 



s+\s„ 
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A similar expression can be derived for the ratio h(g, 5* ,$*)/h(g' , 5* ,$*) and the result 
follows. 



The following lemma gives an efficient method for evaluating the terms in (23) using 
Cholesky decompositions. 

Lemma 9. Using the notation of Theorem 6 and Lemma 7, suppose that the matrix Ac q c q > 
is partitioned as 



Ac q c a = 



As q Aq n 
Aos q2 A DD 



and has Cholesky decomposition Ac c„ — LL' where 



and 
Then 

(a) A DD \ Sn = L DD (L DD ) 

(b) A DD]Sq2 = (l aa f (Ippf 

(c) A aa \g g2 = {I act) 

(d) A ms = (l fja ) 2 + {l 0l3 f 



Lq Q 

^DSg 2 L DD 





hp 



Proof. The proof is straightforward and is omitted. 



Equation (15) and parts (b) — (d) of Lemma 9 give an efficient expression for the conditional 
distributions in Section 4. The main computational effort is in updating the Cholesky 
decompositions of the matrices <&c q C q and c whenever an edge is added or deleted. 
From Lemma 9, these Cholesky decomposition must be done with the entries for the ith 
and jth vertices in the lower right corner. Note that efficient Cholesky updating routines 
using Givens rotations are available in Matlab and Fortran. Note also that the dimensions 
of &c q C q and $c q C q depend on the cliques sizes and may be much smaller than p. Thus our 
method has the local computational properties described in Giudici and Green (1999) and 
will have similar computational cost to their method per iteration of the Gibbs sampler. 
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